Yelp Phoenix Dataset

In this notebook, the data of the Yelp Phoenix Academic Dataset will be analyzed to give us an insight of what experiments can be run using the data

Businesses

The first dataset we will analyze is the dataset that contains information about the businesses. The businesses are stored in the file yelp_academic_dataset_business.json. There are 15.585 businesses.

Fields

Each business contains 15 fields which are explained below:

  • type: Contains the type of the data. All rows in the yelp_academic_dataset_business.json file have the value business in the type field.
  • business_id: Contains an encrypted business id.
  • name: The name of the business.
  • neighborhoods: It's supposed to tell the neighborhoods where the business belongs, but for all rows this field contains an empty array.
  • full_address: The local address of the business.
  • city: The city where the business is located.
  • state: The state where the business is located. Of the 15.585 businesses only 3 of them don't belong to Arizona.
  • latitude: The latitude of the business.
  • longitude: The longitude of the business.
  • stars: The star rating that user have given to this business. The rating is rounded to half-stars.
  • review_count: The number of reviews this business has received.
  • categories: An array with the categories which this business belongs. For example, [Restaurant, Bar, Mexican]
  • open: I think this indicates if the business is still operating.
  • hours: A dictionary with the opening hours of the business for each day of the week.
  • attributes: A dictionary with additional information about the business. For example 'Accepts credit cards', 'Delivery', 'Price range', 'Parking', etc.

Data analysis

We will do some coding to analyze the data.


In [1]:
import json
from pandas import DataFrame
business_file = 'yelp_academic_dataset_business.json'
business_records = [json.loads(line) for line in open(business_file)]
business_data_frame = DataFrame(business_records)

column = 'stars'
business_counts = business_data_frame.groupby(column).size()

#print counts

In [2]:
business_counts.plot(kind='bar', rot=0)


Out[2]:
<matplotlib.axes.AxesSubplot at 0xc8b9510>

To speed up things, I have created a couple of functions that automatically plot the data when you pass the JSON file name


In [3]:
def plot_json_file(file_path, column, plot_type='line', title=None,
                       x_label=None, y_label=None, show_total=True,
                       show_range=False, y_scale='linear'):
        """
        Creates a DataFrame object from a JSON file and returns the plot of the
        data including the values for the mean, median, standard deviation, and
        if requested, the sum of all the values, and a range with the minimum
        and maximum values

        @param file_path: the absolute path of the JSON file that contains the
        data
        @param column: the column which will be used to group and count the data
        @param plot_type: the type of graph. For example 'bar', 'barh', 'line',
        etc.
        @param title: the title of the graph
        @param x_label: the label for the x axis
        @param y_label: the label for the y axis
        @param show_total: a boolean which indicates if the sum of all the
        values should be displayed on the graph
        @param show_range: a boolean which indicates if the minimum and maximum
        values should be displayed on the graph
        """
        records = [json.loads(line) for line in open(file_path)]

        # Inserting all records stored in form of lists in to 'pandas DataFrame'
        data_frame = DataFrame(records)
        plot_data(data_frame, column, plot_type, title,
                                     x_label, y_label, show_total, show_range, y_scale)

    
def plot_data(data_frame, column, plot_type='line', title=None,
                  x_label=None, y_label=None, show_total=True,
                  show_range=False, y_scale='linear'):
        
        counts = data_frame.groupby(column).size()
        mean = data_frame.mean()[column]
        std = data_frame.std()[column]
        median = data_frame.median()[column]

        label = 'mean=' + str(mean) + '\nmedian=' + str(
            median) + '\nstd=' + str(std)

        if show_total:
            total = data_frame.sum()[column]
            label = label + '\ntotal=' + str(total)

        if show_range:
            min_value = data_frame.min()[column]
            max_value = data_frame.max()[column]
            label = label + '\nrange=[' + str(min_value) + ', ' + str(
                max_value) + ']'

        fig, ax = plt.subplots(1)

        counts.plot(kind=plot_type, rot=0)
        ax.set_xlabel(x_label)
        ax.set_ylabel(y_label)
        ax.set_title(title)
        ax.set_yscale(y_scale)

        # these are matplotlib.patch.Patch properties
        properties = dict(boxstyle='round', facecolor='wheat', alpha=0.95)

        ax.text(0.05, 0.95, label, fontsize=14, transform=ax.transAxes,
                verticalalignment='top', bbox=properties)

If we call this function we can obtain the same results (or even better ones!)


In [4]:
plot_json_file('yelp_academic_dataset_business.json', 'stars', 'bar',
          'Businesses\' ratings', 'Rating', 'Number of places', False)


As can be seen on the graphic above, most of the businesses have an average rating of either 3.5 or 4. Very few businesses have bad ratings.

Now we will continue to analyze the businesses' data but this time looking at the number of reviews that each business has.


In [5]:
reviews_plot = plot_json_file(business_file, 'review_count', 'line', 'Reviews per business',
                                          'Review count', 'Frequency', True, True, 'log')


We can see that the great majority of businesses have a very few number of reviews (as shown by the median). In average, each business has around 23 reviews. The business with least reviews has 3, and the business with most reviews has 1170.

Reviews

This is the biggest dataset given in the Yelp data challenge, the file name is yelp_academic_dataset_review.json. There are 335,022 reviews.

Fields

Each review contains 7 fields which are explained below:

  • type: Contains the type of the data. All rows in the yelp_academic_dataset_review.json file have the value review in the type field.
  • business_id: Contains an encrypted business id.
  • user_id: Contains an encrypted user id.
  • stars: The rating awarded in the review.
  • text: The text for the review.
  • date: The date of the review in 'yyyy-mm-dd' format.
  • votes: A dictionary with the votes that the review has received. There are three categories for the votes: 'coll', 'funny' and 'useful'.

Data analysis

User ratings

As it can be seen in the image above, the majority of user ratings are 4 and 5 stars, and there is a huge gap between 3 and 4 stars ratings. One could say one of two things: users usually don't rate bad places, or there are fewer bad places than good places.

Some useful stats are:

  • Average rating: 3,770364687
  • Median: 4
  • Standard deviation 1,259828763

Number of reviews per user

  • Average number of reviews per user: 4,730813223
  • Median : 1
  • Standard deviation: 15,81852856
  • User with the least number of reviews: 1
  • User with the most number of reviews: 774

Reviews

This dataset contains information about 70,817 users, which are stored in the yelp_academic_dataset_user.json file.

Fields

Each user contains 11 fields which are explained below:

  • type: Contains the type of the data. All rows in the yelp_academic_dataset_user.json file have the value user in the type field.
  • user_id: Contains an encrypted user id.
  • name: Contains the first name of the user.
  • review_count: The number of reviews this users has made.
  • average_stars: The average rating of this user.
  • votes: A dictionary with the number of votes this user has made, for each type of vote. There are three categories for the votes: 'cool', 'funny' and 'useful'.
  • friends: A list with the user's friends.
  • elite: A list with the years that this user has been elite. 93% of the users have an empty list in this field.
  • yelping_since: The date this user joined yelp in 'yyyy-mm-dd' format.
  • compliments: A dictionary with the number of votes this user has received, for each type of vote.
  • fans: The number of fans this user has.

Data Analysis

We analize the data using the functions we created earlier


In [6]:
plot_json_file('yelp_academic_dataset_user.json', 'review_count', 'line', 'Reviews per user',
               'Review count', 'Frequency', True, True, 'log')


As it can be seen in the above graphic, the numbers between the user's reviews and the reviews data set don't seem to mathc. This is probably because the users of the user dataset have also made reviews in other places, and not just in Phoenix. And the reviews dataset only contains reviews that were made for businesses in Phoenix.

Check-in

This dataset contains information about 11,434 check-ins, which are stored in the yelp_academic_dataset_checkin.json file.

Fields

Each check-in contains 3 fields which are explained below:

  • type: Contains the type of the data. All rows in the yelp_academic_dataset_checkin.json file have the value checkin in the type field.
  • business_id: Contains an encrypted business id.
  • checkin_info: Contains a dictionary with the number of checkins for each hour and each day of the week.

Data analysis

We analyze the number of checkins per hour and per day of the week

There are a total of 1.457.303 check-ins for 11434 businesses, which gives as a total of 127 check-in per business, and 18 check-ins per business per day.

Tip

This dataset contains information about 113,993 check-ins, which are stored in the yelp_academic_dataset_checkin.json file. A tip seems very similar to a review without a rating.

Fields

Each tip contains 6 fields which are explained below:

  • type: Contains the type of the data. All rows in the yelp_academic_dataset_tip.json file have the value tip in the type field.
  • text: The text for the tip.
  • business_id: Contains an encrypted business id.
  • user_id: Contains an encrypted user id.
  • date: The date of the review in 'yyyy-mm-dd' format.
  • likes: The number of likes this tip has received.